In this tutorial, we will apply CrowdTruth metrics to a multiple choice crowdsourcing task for Person Type Annotation from video fragments. The workers were asked to watch a video of about 3-5 seconds and then pick from a multiple choice list which are the types of person that appear in the video fragment. The task was executed on FigureEight. For more crowdsourcing annotation task examples, click here.
To replicate this experiment, the code used to design and implement this crowdsourcing annotation template is available here: template, css, javascript.
This is a screenshot of the task as it appeared to workers:
A sample dataset for this task is available in this file, containing raw output from the crowd on FigureEight. Download the file and place it in a folder named data
that has the same root as this notebook. Now you can check your data:
In [1]:
import pandas as pd
test_data = pd.read_csv("../data/person-video-multiple-choice.csv")
test_data.head()
Out[1]:
In [2]:
import crowdtruth
from crowdtruth.configuration import DefaultConfig
Our test class inherits the default configuration DefaultConfig
, while also declaring some additional attributes that are specific to the Person Type Annotation in Video task:
inputColumns
: list of input columns from the .csv file with the input dataoutputColumns
: list of output columns from the .csv file with the answers from the workersannotation_separator
: string that separates between the crowd annotations in outputColumns
open_ended_task
: boolean variable defining whether the task is open-ended (i.e. the possible crowd annotations are not known beforehand, like in the case of free text input); in the task that we are processing, workers pick the answers from a pre-defined list, therefore the task is not open ended, and this variable is set to False
annotation_vector
: list of possible crowd answers, mandatory to declare when open_ended_task
is False
; for our task, this is the list of relationsprocessJudgments
: method that defines processing of the raw crowd data; for this task, we process the crowd answers to correspond to the values in annotation_vector
The complete configuration class is declared below:
In [3]:
class TestConfig(DefaultConfig):
inputColumns = ["videolocation", "subtitles", "imagetags", "subtitletags"]
outputColumns = ["selected_answer"]
# processing of a closed task
open_ended_task = False
annotation_vector = ["archeologist", "architect", "artist", "astronaut", "athlete", "businessperson","celebrity",
"chef", "criminal", "engineer", "farmer", "fictionalcharacter", "journalist", "judge",
"lawyer", "militaryperson", "model", "monarch", "philosopher", "politician", "presenter",
"producer", "psychologist", "scientist", "sportsmanager", "writer", "none", "other"]
def processJudgments(self, judgments):
# pre-process output to match the values in annotation_vector
for col in self.outputColumns:
# transform to lowercase
judgments[col] = judgments[col].apply(lambda x: str(x).lower())
# remove square brackets from annotations
judgments[col] = judgments[col].apply(lambda x: str(x).replace('[',''))
judgments[col] = judgments[col].apply(lambda x: str(x).replace(']',''))
# remove the quotes around the annotations
judgments[col] = judgments[col].apply(lambda x: str(x).replace('"',''))
return judgments
In [4]:
data, config = crowdtruth.load(
file = "../data/person-video-multiple-choice.csv",
config = TestConfig()
)
data['judgments'].head()
Out[4]:
In [5]:
results = crowdtruth.run(data, config)
In [6]:
results["units"].head()
Out[6]:
In [7]:
import matplotlib.pyplot as plt
%matplotlib inline
plt.hist(results["units"]["uqs"])
plt.xlabel("Video Fragment Quality Score")
plt.ylabel("Video Fragments")
Out[7]:
The unit_annotation_score
column in results["units"]
contains the video fragment-annotation scores, capturing the likelihood that an annotation is expressed in a video fragment. For each video fragment, we store a dictionary mapping each annotation to its video fragment-annotation score.
In [8]:
results["units"]["unit_annotation_score"].head()
Out[8]:
A low unit quality score can be used to identify ambiguous video fragments. First, we sort the unit quality metrics stored in results["units"]
based on the quality score (uqs
), in ascending order. Thus, the most clear video fragments are found at the tail of the new structure:
In [9]:
results["units"].sort_values(by=["uqs"])[["input.videolocation", "uqs", "unit_annotation_score"]].head()
Out[9]:
Below we show an example video fragment with low quality score, where workers couldn't agree on what annotation best describes the person in the video. The role of the person in the video is not directly specified, so the workers made assumptions based on the topic of discussion.
In [10]:
from IPython.display import HTML
print(results["units"].sort_values(by=["uqs"])[["uqs"]].iloc[0])
print("\n")
print("Person types picked for the video below:")
for k, v in results["units"].sort_values(by=["uqs"])[["unit_annotation_score"]].iloc[0]["unit_annotation_score"].items():
if v > 0:
print(str(k) + " : " + str(v))
vid_url = list(results["units"].sort_values(by=["uqs"])[["input.videolocation"]].iloc[0])
HTML("<video width='320' height='240' controls><source src=" + vid_url[0] + " type='video/mp4'></video>")
Out[10]:
In [11]:
results["units"].sort_values(by=["uqs"], ascending=False)[["input.videolocation", "uqs", "unit_annotation_score"]].head()
Out[11]:
Below we show an example unambiguous video fragment - no person appears in the video, so most workers picked the none
option in the crowd task.
In [12]:
print(results["units"].sort_values(by=["uqs"], ascending=False)[["uqs"]].iloc[0])
print("\n")
print("Person types picked for the video below:")
for k, v in results["units"].sort_values(by=["uqs"], ascending=False)[["unit_annotation_score"]].iloc[0]["unit_annotation_score"].items():
if v > 0:
print(str(k) + " : " + str(v))
vid_url = list(results["units"].sort_values(by=["uqs"], ascending=False)[["input.videolocation"]].iloc[0])
HTML("<video width='320' height='240' controls><source src=" + vid_url[0] + " type='video/mp4'></video>")
Out[12]:
The worker metrics are stored in results["workers"]
. The wqs
columns in results["workers"]
contains the worker quality scores, capturing the overall agreement between one worker and all the other workers.
In [13]:
results["workers"].head()
Out[13]:
In [14]:
plt.hist(results["workers"]["wqs"])
plt.xlabel("Worker Quality Score")
plt.ylabel("Workers")
Out[14]:
In [15]:
results["workers"].sort_values(by=["wqs"]).head()
Out[15]:
Example annotations from low quality worker 44606916
(with the second lowest quality score) for video fragment 1856509900
:
In [16]:
import operator
work_id = results["workers"].sort_values(by=["wqs"]).index[1]
work_units = results["judgments"][results["judgments"]["worker"] == work_id]["unit"]
work_judg = results["judgments"][results["judgments"]["unit"] == work_units.iloc[0]]
print("JUDGMENTS OF LOW QUALITY WORKER %d FOR VIDEO %d:" % (work_id, work_units.iloc[0]))
for k, v in work_judg[work_judg["worker"] == work_id]["output.selected_answer"].iloc[0].items():
if v > 0:
print(str(k) + " : " + str(v))
print("\nALL JUDGMENTS FOR VIDEO %d" % work_units.iloc[0])
sorted_judg = sorted(
results["units"]["output.selected_answer"][work_units.iloc[0]].items(),
key=operator.itemgetter(1),
reverse=True)
for k, v in sorted_judg:
if v > 0:
print(str(k) + " : " + str(v))
vid_url = results["units"]["input.videolocation"][work_units.iloc[0]]
HTML("<video width='320' height='240' controls><source src=" + str(vid_url) + " type='video/mp4'></video>")
Out[16]:
Example annotations from the same low quality worker (44606916
) for a second video fragment (1856509903
):
In [17]:
work_judg = results["judgments"][results["judgments"]["unit"] == work_units.iloc[1]]
print("JUDGMENTS OF LOW QUALITY WORKER %d FOR VIDEO %d:" % (work_id, work_units.iloc[1]))
for k, v in work_judg[work_judg["worker"] == work_id]["output.selected_answer"].iloc[0].items():
if v > 0:
print(str(k) + " : " + str(v))
print("\nALL JUDGMENTS FOR VIDEO %d" % work_units.iloc[0])
sorted_judg = sorted(
results["units"]["output.selected_answer"][work_units.iloc[1]].items(),
key=operator.itemgetter(1),
reverse=True)
for k, v in sorted_judg:
if v > 0:
print(str(k) + " : " + str(v))
vid_url = results["units"]["input.videolocation"][work_units.iloc[1]]
HTML("<video width='320' height='240' controls><source src=" + str(vid_url) + " type='video/mp4'></video>")
Out[17]:
In [18]:
results["workers"].sort_values(by=["wqs"], ascending=False).head()
Out[18]:
Example annotations from worker 6432269
(with the highest worker quality score) for video fragment 1856509904
:
In [19]:
work_id = results["workers"].sort_values(by=["wqs"], ascending=False).index[0]
work_units = results["judgments"][results["judgments"]["worker"] == work_id]["unit"]
work_judg = results["judgments"][results["judgments"]["unit"] == work_units.iloc[0]]
print("JUDGMENTS OF HIGH QUALITY WORKER %d FOR VIDEO %d:" % (work_id, work_units.iloc[0]))
for k, v in work_judg[work_judg["worker"] == work_id]["output.selected_answer"].iloc[0].items():
if v > 0:
print(str(k) + " : " + str(v))
print("\nALL JUDGMENTS FOR VIDEO %d" % work_units.iloc[1])
sorted_judg = sorted(
results["units"]["output.selected_answer"][work_units.iloc[0]].items(),
key=operator.itemgetter(1),
reverse=True)
for k, v in sorted_judg:
if v > 0:
print(str(k) + " : " + str(v))
vid_url = results["units"]["input.videolocation"][work_units.iloc[0]]
HTML("<video width='320' height='240' controls><source src=" + str(vid_url) + " type='video/mp4'></video>")
Out[19]:
Example annotations from worker 6432269
(with the highest worker quality score) for video fragment 1856509908
:
In [20]:
work_id = results["workers"].sort_values(by=["wqs"], ascending=False).index[0]
work_units = results["judgments"][results["judgments"]["worker"] == work_id]["unit"]
work_judg = results["judgments"][results["judgments"]["unit"] == work_units.iloc[1]]
print("JUDGMENTS OF HIGH QUALITY WORKER %d FOR VIDEO %d:" % (work_id, work_units.iloc[1]))
for k, v in work_judg[work_judg["worker"] == work_id]["output.selected_answer"].iloc[0].items():
if v > 0:
print(str(k) + " : " + str(v))
print("\nALL JUDGMENTS FOR VIDEO %d" % work_units.iloc[1])
sorted_judg = sorted(
results["units"]["output.selected_answer"][work_units.iloc[1]].items(),
key=operator.itemgetter(1),
reverse=True)
for k, v in sorted_judg:
if v > 0:
print(str(k) + " : " + str(v))
vid_url = results["units"]["input.videolocation"][work_units.iloc[1]]
HTML("<video width='320' height='240' controls><source src=" + str(vid_url) + " type='video/mp4'></video>")
Out[20]:
In [21]:
plt.scatter(results["workers"]["wqs"], results["workers"]["judgment"])
plt.xlabel("WQS")
plt.ylabel("# Annotations")
Out[21]:
The annotation metrics are stored in results["annotations"]
. The aqs
column contains the annotation quality scores, capturing the overall worker agreement over one annotation.
There is a slight correlation between the number of annotations (column output.selected_answer
) and the annotation quality score - annotations that have not been picked often (e.g. engineer
, farmer
) tend to have lower quality scores - this is because these annotations are less present in the corpus, therefore the likelihood that they are picked is lower, and when they do get picked it is more likely it was a mistake by the worker. However, it is not a set rule, and there exist annotations that are picked less often (e.g. astronaut
) that can have high quality scores.
In [22]:
results["annotations"]["output.selected_answer"] = 0
for idx in results["judgments"].index:
for k,v in results["judgments"]["output.selected_answer"][idx].items():
if v > 0:
results["annotations"].loc[k, "output.selected_answer"] += 1
results["annotations"] = results["annotations"].sort_values(by=["aqs"], ascending=False)
results["annotations"].round(3)[["output.selected_answer", "aqs"]]
Out[22]:
In [23]:
rows = []
header = ["unit", "videolocation", "subtitles", "imagetags", "subtitletags", "uqs", "uqs_initial"]
annotation_vector = ["archeologist", "architect", "artist", "astronaut", "athlete", "businessperson","celebrity",
"chef", "criminal", "engineer", "farmer", "fictionalcharacter", "journalist", "judge",
"lawyer", "militaryperson", "model", "monarch", "philosopher", "politician", "presenter",
"producer", "psychologist", "scientist", "sportsmanager", "writer", "none", "other"]
header.extend(annotation_vector)
annotation_vector_in = ["archeologist_initial_initial", "architect_initial", "artist_initial", "astronaut_initial",
"athlete_initial", "businessperson_initial","celebrity_initial", "chef_initial",
"criminal_initial", "engineer_initial", "farmer_initial", "fictionalcharacter_initial",
"journalist_initial", "judge_initial", "lawyer_initial", "militaryperson_initial",
"model_initial", "monarch_initial", "philosopher_initial", "politician_initial",
"presenter_initial", "producer_initial", "psychologist_initial", "scientist_initial",
"sportsmanager_initial", "writer_initial", "none_initial", "other_initial"]
header.extend(annotation_vector_in)
units = results["units"].reset_index()
for i in range(len(units.index)):
row = [units["unit"].iloc[i], units["input.videolocation"].iloc[i], units["input.subtitles"].iloc[i], \
units["input.imagetags"].iloc[i], units["input.subtitletags"].iloc[i], units["uqs"].iloc[i],
units["uqs_initial"].iloc[i]]
for item in annotation_vector:
row.append(units["unit_annotation_score"].iloc[i][item])
for item in annotation_vector_in:
row.append(units["unit_annotation_score_initial"].iloc[i][item])
rows.append(row)
rows = pd.DataFrame(rows, columns=header)
rows.to_csv("../data/results/multchoice-people-video-units.csv", index=False)
In [24]:
results["workers"].to_csv("../data/results/multchoice-people-video-workers.csv", index=True)
In [25]:
results["annotations"].to_csv("../data/results/multchoice-people-video-annotations.csv", index=True)